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Abstract 

This paper proposes a Hilbert space embedding for Dirichlet Process mixture models 
via a stick-breaking construction of Sethuraman (6). Although Bayesian nonparamet- 
rics offers a powerful approach to construct a prior that avoids the need to specify the 
model size/complexity explicitly, an exact inference is often intractable. On the other 
hand, frequentist approaches such as kernel machines, which suffer from the model se- 
lection/comparison problems, often benefit from efficient learning algorithms. This paper 
discusses the possibility to combine the best of both worlds by using the Dirichlet Process 
mixture model as a case study. 



1 Dirichlet Process mixture models 

Much of the real-world data cannot be explained by nice simple probability models. Rather, they often 
come from heterogeneous sources of unknown properties, which require more complex probability models. 
Mixture modelling is a popular way of representing such heterogeneity, and also forms a basis for many 
Bayesian probabilistic models. Unfortunately, a long-standing difficulty in mixture modelling is choosing 
the number of mixture components, i.e., the number of sources from which the data are generated. Dirichlet 
Process mixture model (DPMM) allows for the apriori unbounded number of components whose values can 
be inferred from the observed data. 

As a basis of DPMM, we first give a formal definition of the Dirichlet Process (DP), taken from [3 1. 

Definition 1 (Dirichlet Process). A Dirichlet Process is a distribution of a random probability measure G 
over a measurable space (fi, B), such that for any finite partition (A\,...,A r ) offlfi.e., Q = IL =1 Ai, 
where ]J means disjoint union and Ai £ B), we have 

(G(Ax), G(A r )) ~ Dir(aG (Ai), aG (A r )) 

where G(Ai) = J A dG and Go(Ai) = J A dGo for i = 1, . . . , r. 

Generally speaking, the DP is a distribution over probability measures. Each draw G from a DP can be 
interpreted as a random distribution, whose sample path is probability measure with probability one. The 
base distribution Go can be thought of as the mean of the DP, whereas the strength parameter a can be 
regarded as an inverse-variance. 
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The DP has received much attention and has been extensively studied in the past few years, especially 
in Bayesian nonparametrics community. Several scenarios have been proposed to show the existence of 
the DP. For example, Blackwell and MacQueen used the Polya urn scheme to show that the distributions 
sampled from a DP are discrete almost surely Qj. Equivalent to the extended Polya urn scheme is a Chinese 
restaurant process (CRP), which is a random process where n customers sit in a Chinese restaurant with an 
infinite number of tables. Moreover, one may look at draws from a DP as a weighted sum of point masses. 
This point was made precise by the stick-breaking construction of Sethuraman Q. 

In this paper, we resort to this constructive way of forming G. It can be described by the generative process: 

i— 1 oo 

/8i~Beta(l,o), tt, = & JJ(1 - h ~ G , G = £7^ • 

k=l i=l 

The following theorem establishes the connection between the stick-breaking construction and the Dirichlet 
process given in the Definition Q] 

Theorem 1. The stick-breaking construction gives the same probability measure over all random measures 
on the measurable space (fi, B) with the Dirichlet Process with same parameter a and Gq. 

By mean of the stick-breaking construction, we consider the Dirichlet Process mixture model (DPMM) of 
the form Y^Li n ifoi ( x )< which is a mixture of distributions having the same parametric form / but differing 
in their parameters. Like many statistical models, exact inference in the DPMM is intractable, and thereby 
efficient approximate inferences are needed. The most popular inference methods for DPMM are Markov 
chain Monte Carlo (MCMC), variational Bayesian (VB), and collapsed variational methods. Unlike most 
previous approaches in nonparametric Bayesian, we study a new approach by employing the Hilbert space 
embedding. This approach leads to the kernel-based inference for DPMM. 

2 Hilbert space embedding for Dirichlet Process mixtures 

If we consider the base measure Go to be the distribution over the parameter space 9 and let fg, 9 ~ Go 
denote the density function parametrized by 9. Each draw from the DPMM defines the density function 
Fg(x) = ^ifdi ( x )- We will represent the probability distribution with density Fg as P^e and repre- 

sent the set of all P-^.e by *P Q ,e- 

Let H be the reproducing kernel Hilbert space (RKHS) with a reproducing kernel k. Assume that k(x, x) is 
bounded for all x. Then, the Dirichlet Process Mixture Embedding (DPME) is defined as 

/oo - 

We will denote the embedding of P-^e by TfP^e]. Since we have Y^fc=i nk = 1 almost surely and k(x, •) < 
00 for all x 6 X, it follows that ||T[P Wi e]||^ < 00. Therefore, the DPME is well-defined. 

Unfortunately, working directly with the DPME is cumbersome because of an infinite sum. Ishwaran and 
James [4] made an important observation that a truncation of the stick-breaking representation at a suf- 
ficiently large T already provides an excellent approximation to the full DPMM model. As a result, we 
propose the truncated Dirichlet Process Mixture Embedding (tDPME): 

T 

T : y a ,e,T^H, fn,e,T^ J k{x, •) d¥^ T {x) 4 J k{x,-)df 0i (x) 

i—1 
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Figure 1: The hierarchical structure of the truncated DPME. The truncation level T imposes a hierarchi- 
cal structure on the class of distributions *}3 Q .e,T- As T increases, the set *P Qj e,T enlarges, giving more 
flexibility to the model. 

The ^p a .e,T and V-k,8,t denote the truncated version of ty a ,e an d IP^.e, respectively, where T > is a 
truncation level. The following theorem presents the RKHS version of the almost-sure truncation known in 
the nonparametric Bayesian literature. 

Theorem 2 (Almost-sure truncation). Let H be a reproducing kernel Hilbert space (RKHS) with a repro- 
ducing kernel k. Assume that \\k(x, •) ||^ < Rfor all x. The following inequality holds: 

||T[P w , e ] - TpP„, fliT ]||^ < C ■ exp {-T/a) 
where C is an arbitrary constant. 

Proof, we have 

||TpP W)fl ]-T[P w ,fl,T]| 



We can see that J R dfg i (x) is finite for all i. Thus, letting J R dfg i (x) < C for all i with some constant C 
yields 

oo r oo / T \ 

E n / Rdf 9i (x) < J2 nC = cll-Y,*k)MC-exp(-T/a) . 

i=T+l J t=T+l V i=l / 

The last step of the proof uses the fact that Y^J=i = Ei=i ( ex P(~ ^i-i/ a ) — exp(— Ti/aj) = 1 — 
exp(T T /a) wl- exp(-T/a) where T T = E ± + E 2 -\ h E T and E, ~ exp(l) (cf. @). ■ 

Theorem |2] asserts that the truncated DPME is close in RKHS norm to the true DPME with a sufficiently 
large truncation level T. Consequently, by working with the truncated DPME instead of the true DPME, we 
are not losing much information. Moreover, the bound also suggests how to choose T. That is, for the error 
to be smaller than S, one must choose T such that T > a ln(<5/C). The effect of setting different truncation 
level T in the DPME can be seen in Figure Q] 



oo T 

E 71 " 1 / k(x,-)df ei (x) - y^TTj / k(x, -)df 9i (x) 
i=i -— 1 
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2.1 Optimization 



Given observation x\ , x-i , . . . , x m , we would like to find P-^e.T that is as close as possible to the underlying 
distribution P of the observation. To accomplish this, we employ the usual Hilbert space embedding of P 
given by [if = ~E xr ^p[k(x, ■)]. The empirical estimate of /xp can be computed from observation as fix = 
— YskLi k( x k> ')■ Then, the optimization problem can be cast as follow: 

min — TpPW,e,r]||« subjectto tt t 1 = 1, tt, > . (1) 

To prevent overfitting, we introduce a regularizer f2(7r) = i 1 1 1 1 2 with a regularization constant e > 0. 
Substituting fix and TpP W) e t] back into (Q]) yields a quadratic programming (QP) for tt: 

min -7r T (S + el) 7r — R T 7r subjectto 7r T l = 1, 7r,; > , 

where I is the identity matrix, S e R TxT and R G IR T are given by = (£t[/ej, an d R-j = 

(Axj /■*[/# ,])«> respectively, and = J •) d/^ (a;). Note that our optimization problem is similar 

to the one in Q. Thus, due to space constraint, we ask the readers to consult [7] on how to compute S and 
R as well as the detail on how to perform an optimization. 

The optimization problem we use here is conceptually similar to the variational methods for DPMM |2|. 
That is, we are minimizing the distance between the approximate quantity P^.^.t and the true quantity P. 
Moreover, both MCMC and VB require access to the latent variables associated with observations in order to 
perform an inference, whereas our approach does not require access to the latent variable whatsoever during 
an inference. The values of the latent variables, on the other hand, are computed as a postprocessing step. 

3 Discussions 

We are investigating some open questions related to the proposed kernel-based inference of the DPMM. For 
example, it is vital to understand how the solution of the above optimization problem relates to the solution 
of the standard inference methods such as maximum likelihood and MAP of the DPMM. Is there a kernel 
k for which these solutions coincide? What is the effect of choosing different kernel fc? and what is the 
connection of our approach to the basic k-mean algorithm? The answers to these questions will be the 
mutual benefit of researchers in kernel methods and Bayesian nonparametrics. 
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